fix: PIL Image Memory Leaks in Dataset Builders#194
Conversation
|
✅ DCO Check Passed Thanks @samiuc, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: a592458 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: ac622cc I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: 3648bd9 Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: a592458 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: ac622cc I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: 3648bd9 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: c05b85e Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: a592458 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: ac622cc I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: 3648bd9 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: c05b85e I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: 36044e5 Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
|
@samiuc I would be interested in merging the memory-leaking related fixes, which are definitely needed. However, the PR also addresses unrelated things that I would like to consider on separate PRs:
Could you separate the other things out into new PRs please? Thanks. |
|
@cau-git Yes, I do plan to split this into multiple PRs. The main reason everything is currently in a single PR is that internally we build a wheel and test it in our test environment, and it’s alot faster to test multiple fixes together on the same branch. Once the testing is complete, I’ll open separate PRs. Thanks for taking a look, I’ll follow up shortly with the new PRs. |
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: a592458 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: ac622cc I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: 3648bd9 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: c05b85e I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: 36044e5 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: b83a1f0 Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: a592458 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: ac622cc I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: 3648bd9 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: c05b85e I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: 36044e5 I, samiuc <sami.ullah.chat@gmail.com>, hereby add my Signed-off-by to this commit: b83a1f0 Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
… into sami/fix-memory-leaks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
…oject/docling-eval into sami/fix-memory-leaks
Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
@cau-git Can you please review this PR when you get some time? Thank you |
cau-git
left a comment
There was a problem hiding this comment.
Confirmed working, good to merge! Thanks.
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
During our testing, we found that the memory grows unbounded when processing large datasets, eventually causing OOM errors caused by the PIL Image objects opened during dataset iteration are never closed. Images opened via Image.open() in builders and decoded via Features_Image().decode_example() when loading from parquet accumulate in memory.
So the code changes in this PR add an auto-close PIL Images in as_record_dict() after serialization. Added _close_images()
method to DatasetRecord and DatasetRecordWithPrediction that releases image resources. Also added try/finally blocks in builders (pixparse, funsd, xfund, file_dataset) to close images after local processing.